Corpus similarity measures remain robust across diverse languages

نویسندگان

چکیده

This paper experiments with frequency-based corpus similarity measures across 39 languages using a register prediction task. The goal is to quantify (i) the distance between different corpora from same language and (ii) homogeneity of individual corpora. Both these goals are essential for measuring how well corpus-based linguistic analysis generalizes one dataset another. problem that previous work has focused on Indo-European languages, raising question whether able provide robust generalizations diverse languages. uses task evaluate competing languages: they distinguish representing contexts production? Each experiment compares three single language, digital registers shared all social media, web pages, Wikipedia. Results show retain their validity families, writing systems, types morphology. Further, remain when evaluated out-of-domain corpora, applied low-resource sets registers. These findings significant given our need make rapidly increasing number available analysis.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Measures for Corpus Similarity and Homogeneity

How similar are two corpora? A measure of corpus similarity would be very useful for NLP for many purposes, such as estimating the work involved in porting a system from one domain to another. First, we discuss difficulties in identifying what we mean by 'corpus similariti: human similarity judgements are not finegrained enough, corpus similarity is inherently multidimensional, and similarity c...

متن کامل

Robust clustering of languages across Wikipedia growth

Wikipedia is the largest existing knowledge repository that is growing on a genuine crowdsourcing support. While the English Wikipedia is the most extensive and the most researched one with over 5 million articles, comparatively little is known about the behaviour and growth of the remaining 283 smaller Wikipedias, the smallest of which, Afar, has only one article. Here, we use a subset of thes...

متن کامل

Robust Similarity Measures for Mobile Object Trajectories

We investigate techniques for similarity analysis of spatio-temporal trajectories for mobile objects. Such kind of data may contain a great amount of outliers, which degrades the performance of Euclidean and Time Warping Distance. Therefore, here we propose the use of non-metric distance functions based on the Longest Common Subsequence (LCSS), in conjunction with a sigmoidal matching function....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lingua

سال: 2022

ISSN: ['0024-3841', '1872-6135']

DOI: https://doi.org/10.1016/j.lingua.2022.103377